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Abstract 

We consider training a deep neural network to gener¬ 
ate samples from an unknown distribution given i.i.d. 
data. We frame learning as an optimization mini¬ 
mizing a two-sample test statistic—informally speak¬ 
ing, a good generator network produces samples that 
cause a two-sample test to fail to reject the null hy¬ 
pothesis. As our two-sample test statistic, we use 
an unbiased estimate of the maximum mean discrep¬ 
ancy, which is the centerpiece of the nonparametric 
kernel two-sample test proposed by Gretton et al. |2}. 

We compare to the adversarial nets framework intro¬ 
duced by Goodfellow et al. [1], in which learning is 
a two-player game between a generator network and 
an adversarial discriminator network, both trained to 
outwit the other. From this perspective, the MMD 
statistic plays the role of the discriminator. In addi¬ 
tion to empirical comparisons, we prove bounds on 
the generalization error incurred by optimizing the 
empirical MMD. 

1 Introduction 

In this paper, we consider the problem of learning generative 
models from i.i.d. data with unknown distribution V. We for¬ 
mulate the learning problem as one of finding a function G, 
called the generator , such that, given an input Z drawn from 
some fixed noise distribution AT, the distribution of the output 
G(Z ) is close to the data’s distribution V. Note that, given G 
and Af, we can easily generate new samples despite not having 
an explicit representation for the underlying density. 

We are particularly interested in the case where the genera¬ 
tor is a deep neural network whose parameters we must learn. 
Rather than being used to classify or predict, these networks 
transport input randomness to output randomness, thus induc¬ 
ing a distribution. The first direct instantiation of this idea 
is due to MacKay [7}, although MacKay draws connections 
even further back to the work of Saund mi and others on 
autoencoders, suggesting that generators can be understood as 
decoders. MacKay’s proposal, called density networks , uses 
multi-layer perceptrons (MLP) as generators and learns the pa¬ 
rameters by approximating Bayesian inference. 

Since MacKay’s proposal, there has been a great deal of 


progress on learning generative models, especially over high¬ 
dimensional spaces like images. Some of the most successful 
approaches have been based on restricted Boltzmann machines 
m and deep Boltzmann networks (3). A recent example is the 
Neural Autoregressive Density Estimator due to Uria, Murray, 
and Larochelle G3- An indepth survey, however, is beyond the 
scope of this article. 

This work builds on a proposal due to Goodfellow et al. (lj. 
Their adversarial nets framework takes an indirect approach to 
learning deep generative neural networks: a discriminator net¬ 
work is trained to recognize the difference between training data 
and generated samples, while the generator is trained to confuse 
the discriminator. The resulting two-player game is cast as a 
minimax optimization of a differentiable objective and solved 
greedily by iteratively performing gradient descent steps to im¬ 
prove the generator and then the discriminator. 

Given the greedy nature of the algorithm, Goodfellow et al. 11] 
give a careful prescription for balancing the training of the gen¬ 
erator and the discriminator. In particular, two gradient steps 
on the discriminator’s parameters are taken for every iteration 
of the generator’s parameters. It is not clear at this point how 
sensitive this balance is as the data set and network vary. In this 
paper, we describe an approximation to adversarial learning that 
replaces the adversary with a closed-form nonparametric two- 
sample test statistic based on the Maximum Mean Discrepancy 
(MMD), which we adopted from the kernel two sample test (2). 
We call our proposal MMD nets 1 We give bounds on the es¬ 
timation error incurred by optimizing an empirical estimator 
rather than the true population MMD and give some illustra¬ 
tions on synthetic and real data. 

2 Learning to sample as optimization 

It is well known that, for any distribution V and any continuous 
distribution Af on sufficiently regular spaces X and W, respec¬ 
tively, there is a function G : W —)> X, such that G(W) ~ V 
when W ~ Af. (See, e.g., 0] Lem. 3.22].) In other words, we 
can transform an input from a fixed input distribution Af through 
a deterministic function, producing an output whose distribution 
is V. For a given family {Gq} of functions W X, called gen¬ 
erators , we can cast the problem of learning a generative model 

'in independent work reported in a recent preprint, Li, Swersky, 
and Zemel (6j also propose to use MMD as a training objective for 
generative neural networks. We leave a comparison to future work. 
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Figure 1: (top left) Comparison of adversarial nets and MMD nets, (top right) Here we present a simple one-dimensional illustration of 
optimizing a generator via MMD. Both the training data and noise data are Gaussian distributed and we consider the class of generators given by 
(w) = n + aw. The plot on the left shows the isocontours of the MMD-based cost function and the path taken by gradient descent. On 
right, we show the distribution of the generator before and after a number of training iterations, as compared with the data generating distribution. 
Here we did not resample the generated points and so we do not expect to be able to drive the MMD to zero and match the distribution exactly, 
(bottom) The same procedure is repeated here for a two-dimensional dataset. On the left, we see the gradual alignment of the Gaussian-distributed 
input data to the Gaussian-distributed output data as the parameters of the generator Gq are optimized. The learning curve on the right shows the 
decrease in MMD obtained via gradient descent. 


as an optimization 

arg min£('P, Gg{M)), (1) 

0 

where S is some measure of discrepancy and Gq (A f) is the dis¬ 
tribution of Gq(W ) when W ~ J\f . in practice, we only have 
i.i.d. samples X\, X 2 , ... from V, and so we optimize an em¬ 
pirical estimate of 5(V, Gq (J\T )). 

2.1 Adversarial nets 

Adversarial nets (IJ can be cast within this framework: Let 
{ D 0} be a family of functions X —>> [0,1], called discrimina¬ 
tors. We recover the adversarial nets objective with the discrep¬ 
ancy 

5 AN (V,G e (M)) = max E [log D^X) + log(l - D^Y))], 

<P 

where X ~ V and Y ~ Gq(N). In this case, Eq. (1) becomes 

min max V(Gq, D^) 

0 (p 

where 

V{Ge,D ( f ) ) = E [log D^X) + log(l - D^G e {W)))\ 

for X ~ V and W ~ AT. The output of the discriminator D$ 
can be interpreted as the probability it assigns to its input be¬ 
ing drawn from V, and so V ( Gq , D^) is the expected log loss 
incurred when classifying the origin of a point equally likely 
to have been drawn from V or Gq{M). Therefore, optimiz¬ 
ing <p maximizes the probability of distinguishing samples from 


V and Gq (A f ). Assuming that the optimal discriminator exists 
for every 6, the optimal generator G is that whose output dis¬ 
tribution is closest to V, as measured by the Jensen-Shannon 
divergence, which is minimized when Gq{ AT) = V. 

In Q, the generators Gq and discriminators D$ are chosen to 
be multilayer perceptrons (MLP). In order to find a minimax so¬ 
lution, they propose taking alternating gradient steps along D$ 
and Gq. Note that the composition D^Gq^)) that appears in 
the value function is yet another (larger) MLP. This fact permits 
the use of the back-propagation algorithm to take gradient steps. 

2.2 MMD as an adversary 

In their paper introducing adversarial nets, Goodfellow et al. 11 ] 
remark that a balance must be struck between optimizing the 
generator and optimizing the discriminator. In particular, the 
authors suggest k maximization steps for every one minimiza¬ 
tion step to ensure that D$ is well synchronized with Gq during 
training. A large value for k, however, can lead to overfitting. In 
their experiments, for every step taken along the gradient with 
respect to Gq , they take two gradient steps with respect to D$ 
to bring D$ closer to the desired optimum (Goodfellow, pers. 
comm.). 

It is unclear how sensitive this balance is. Regardless, while 
adversarial networks deliver impressive sampling performance, 
the optimization takes approximately 7.5 hours to train on the 
MNIST dataset running on a nVidia GeForce GTX TITAN GPU 
with 6GB RAM. Can we potentially speed up the process with 
a more tractable choice of adversary? 






































Our proposal is to replace the adversary with the kernel two- 
sample test introduced by Gretton et al. 0- In particular, we 
replace the family of discriminators with a family H of test 
functions X M, closed under negation, and use the maxi¬ 
mum mean discrepancy between V and Gq(N) over %, given 
by 


^mmd-h Gq{N)) = sup E[f(X)]-E[f(Y)}, (2) 

fen 

where X V and Y ~ G e {Af). See Fig.jljfor a comparison of 
the architectures of adversarial and MMD nets. 

While Eq. d2) involves a maximization over a family of func¬ 
tions, Gretton et al. (2) show that it can be solved in closed 
form when % is a reproducing kernel Hilbert space (RKHS). 

More carefully, let % be a reproducing kernel Hilbert space 
(RKHS) of real-valued functions on Q and let (•, -)^ denote its 
inner product. By the reproducing property it follows that there 
exists a reproducing kernel k G H such that every / G T~L can 
be expressed as 

f(x) = (f,k(-,x))n = '^2a i k(x,x i ) (3) 

The functions induced by a kernel k are those functions in the 
closure of the span of the set x) : x G £2}, which is nec¬ 
essarily an RKHS. Note, that for every positive definite kernel 
there is a unique RKHS H such that every function in H satis¬ 
fies Eq. (J3}. 

Assume that X is a nonempty compact metric space and T a 
class of functions / : X —y M. Let p and q be Borel probability 
measures on X, and let X and Y be random variables with dis¬ 
tribution p and q , respectively. The maximum mean discrepancy 
(MMD) between p and q is 

MMD(T,p, q) = sup E[f(X)\ - E[f(Y)\ (4) 

fen 

If T is chosen to be an RKHS H, then 

MMD 2 (7 ,p, q) = ||/ip — Hq\\\ (5) 

where p p G T-L is the mean embedding of p , given by 

p p = / k(x, -)p(dx) G H (6) 

Jx 

and satisfying, for all /gM, 

E[f(X)] = (f^ p ) n . 

The properties of MMD(%, •, •) depend on the underlying 
RKHS T-L. For our purposes, it suffices to say that if we take 
X to be R d and consider the RKHS % induced by Gaussian or 
Laplace kernels, then MMD is a metric, and so the minimum 
of our learning objective is achieved uniquely by V, as desired. 
(For more details, see Sriperumbudur et al. G3-) 

In practice, we often do not have access to p or q. Instead, 
we are given independent i.i.d. data X, X', Xi,..., X^ and 
Ym fom p and q , respectively, and would like to 
estimate the MMD. Gretton et al. (2) showed that 

MMD 2 [U,p,q] = E[k(X,X') - 2 k(X,Y) + k(Y,Y ')] (7) 


Algorithm 1 Stochastic gradient descent for MMD nets. 
Initialize M, 0, a, k 

Randomly divide training set X into X m i n i mini batches 
for in- 1, number-of-iterations do 

Regenerate noise inputs {wi}^ i v ..,m every r iterations 
for n m i n i i 1, X m j n i do 
for m <— 1, M do 
Urn Ge{w rn ) 
end for 

compute the n’th minibatch’s gradient 
update learning rate a (e.g., RMSPROP) 

6^6- aS7C n 

end for 
end for 


and then proposed an unbiased estimator 

mmdI[h,x,y] = 1 X) 

' ' n^n' 

+ M(M-l) X k( yy m ,Vm') (8) 

v ' m^m' 

2 M N 

- X Y k ( Xn,ym ^- 

m= 1 n=l 

3 MMD Nets 

With an unbiased estimator of the MMD objective in hand, we 
can now define our proposal, MMD nets : Fix a neural network 
G< 9 , where 0 represents the parameters of the network. Let 
W = (wi ,... ,wm) denote noise inputs drawn from A f, let 
Yq = ( 2 / 1 ,..., ym) with yj = Ge(wj) denote the noise inputs 
transformed by the network Gq, and let X = (xi, ..,xn) de¬ 
note the training data in R D . Given a positive definite kernel k 
on we minimize C(Yq, X) as a function of 0 , where 

v ' m^m' 

2 M N (9) 

- MX XX k(y m ,X n ). 

m= 1 n=l 

Note that (7(1^, X) is comprised of only those parts of the un¬ 
biased estimator that depend on 6. 

In practice, the minimization is solved by gradient descent, pos¬ 
sibly on subsets of the data. More carefully, the chain rule gives 
us 


VC(Y e ,X) 


1 dC n (Y$,X n ) dGg(w m ) 

N ^ ^ dO 

rt= 1 m=1 


where 


C n (Y e ,X n ) 


1 

M(M - 1 ) 


X k (y mi Vm ') 


2 

M 


M 

^ ^ k{Hmi %n)* 

m =1 


(ii) 












Each derivative dC ™Xo,x n ) j s eas |iy computed for standard ker¬ 
nels like the RBF kernel. Our gradient VC{Yq, X n ) depends on 
the partial derivatives of the generator with respect to its param¬ 
eters, which we can compute using back propagation. 

4 Generalization bounds for MMD 


Assume there exists 71 , 72 > 1 and pi,P 2 G N such that, for 
all e > 0, it holds that fat e (Gk+) < 7i£ _Pl and fat £ {Q^ + ) < 
725 _P2 . Then with probability at least 1 — S, 

MMD 2 (H,Pdata,Pe) < MMD 2 (^,p d ata,P0*) + S, (19) 

with 


MMD nets operate by minimizing an empirical estimate of the 
MMD. This estimate is subject to Monte Carlo error and so the 
network weights (parameters) 9 that are found to minimize the 
empirical MMD may do a poor job at minimizing the exact pop¬ 
ulation MMD. We show that, for sufficiently large data sets, this 
estimation error is bounded, despite the space of parameters 6 
being continuous and high dimensional. 

Let 0 denote the space of possible parameters for the generator 
G< 9 , let A f be the distribution on W for the noisy inputs, and let 
p e = Gq(N) be the distribution of Gq(W) when W ~ J\f for 
0 G 0. Let 6 be the value optimizing the unbiased empirical 
MMD estimate, i.e., 

MMD 2 u (U,X,Y d ) = inf MMD 2 (H,X, Y e ), (12) 

9 

and let 0 * be the value optimizing the population MMD, i.e., 

MMD 2 (?f,Pdata,P 0 *) = infMMD 2 ('H,Pdata,P 0 )- (13) 

9 

We are interested in bounding the difference 

MMD 2 (,Pdata 5 Pq ) - MMD 2 (^,Pdata,P 0 *)- (14) 

To that end, for a measured space X, write L 00 (X) for the space 
of essentially bounded functions on and write (Too (A?)) for 
the unit ball under the sup norm, i.e., 

B(L «,(*)) = {/: X -X R : (V* G X)f(x) G [-1,1]}. 

The bounds we obtain will depend on a notion of complexity 
captured by the fat-shattering dimension: 

Definition 1 (Fat-shattering [ 8 ]). Let X_ n = { 27 ,... ,%n} C 
X and T C B(L 00 (X)). For every 5 > 0, Xn is said to be 
e-shattered by T if there is some function h : X —)► M, such 
that for every / C {1,..., N} there is some // G T for which 

fi{x n ) > h{x n ) + e if n G /, (15) 

fi(x n ) < h{x n ) - e if n £ I. (16) 

For every e, the fat-shattering dimension of T, written fat £ (J r ), 
is defined as 

fat £ (J r ) = sup {\X N \ : X N C X , X N is ^-shattered by X} 

We then have the following bound on the estimation error: 

Theorem 1 (estimation error). Assume the kernel is bounded 
by one. Define 

= {g = k(G e (w ), Gq(-)) : w G W, 0 G 0} (17) 

and 

Qk+ = {9 = k(x, Ge(-)) ■■ X ex, 0€G}. (18) 


£ = r(pi,7i,M) +r(p 2 ,72,M - 1) + 12M 2 ^/log 


( 20 ) 


where the rate r(p,y, N ) w 
r(p, 7 ,M) = 


'M-5 

log? (M) i/p = 2, (21) 

M~p 


ifP < 2, 


'/P > 2, 


for constants C Pl and C P2 depending on p\ and P 2 alone. 


The proof appears in the appendix. We can obtain simpler, 
but slightly more restrictive, hypotheses if we bound the fat- 
shattering dimension of the class of generators {Gq : 0 G 0} 
alone: Take the observation space X to be a bounded subset of 
a finite-dimensional Euclidean space and the kernel to be Lip- 
schitz continuous and translation invariant. For the RBF ker¬ 
nel, the Lipschitz constant is proportional to the inverse of the 
length-scale: the resulting bound loosens as the length scale 
shrinks. 


5 Empirical evaluation 

In this section, we demonstrate the approach on an illustrative 
synthetic example as well as the standard MNIST digits and 
Toronto Face Dataset (TFD) benchmarks. We show that MMD- 
based optimization of the generator rapidly delivers a generator 
that performs well in maximizing the density of a held-out test 
set under a kernel-density estimator. 

5.1 Gaussian data, kernel, and generator 

Under an RBF kernel and Gaussian generator with parame¬ 
ters 6 = {/i, cr}, it is straightforward to find the gradient of 
C(Yq, X) by applying the chain rule. Using fixed random stan¬ 
dard normal numbers {rci,..., wm}, we have y m = p X crw rn 
for m G {1,.., M}. The result of these illustrative synthetic 
experiments can be found in Fig. [l] The dataset consisted of 
N = 200 samples from a standard normal and M = 50 noise 
input samples were generated from a standard normal with a 
fixed random seed. The algorithm was initialized at values 
{/i, cr} = {2.5, 0.1}. We fixed the learning rate to 0.5 and ran 
gradient descent steps for K = 250 iterations. 

5.2 MNIST digits 

We trained our generative network on the MNIST digits dataset 
The generator was chosen to be a fully connected, 3 hidden 
layers neural network with sigmoidal activation functions. Fol¬ 
lowing Gretton et al. (2), we used a radial basis function (RBF) 
kernel, but also evaluated the rational quadratic (RQ) kernel |9j 
and Laplacian kernel, but found that the RBF performed best 
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Figure 2: (top-left) MNIST digits from the training set. (top-right) Newly generated digits produced after 1,000,000 iterations (approximately 5 
hours). Despite the remaining artifacts, the resulting kernel-density estimate of the test data is state of the art. (top-center) Newly generated digits 
after 300 further iterations optimizing the associated empirical MMD. (bottom-left) MMD learning curves for first 2000 iterations, (bottom-right) 
MMD learning curves from 2000 to 500,000 iterations. Note the difference in y-axis scale. No appreciable change is seen in later iterations. 


in the parameter ranges we evaluated. We used Bayesian opti¬ 
mization (WHETLab) to set the bandwidth of the RBF and the 
number of neurons in each layer on initial test runs of 50,000 
iterations. We used the median heuristic suggested by [2] for 
the kernel two-sample test to choose the kernel bandwidth. The 
learning rate was adjusting during optimization by RMSPROP 

(U 

Fig. 2 presents the digits learned after 1,000,000 iterations. We 
performed minibatch stochastic gradient descent, resampling 
the generated digits every 300 iterations, using minibatches of 
size 500, with equal numbers of training and generated points. It 
is clear that the digits produced have many artifacts not appear¬ 
ing in the MNIST data set. Despite this, the mean log density of 
the held-out test data is 315 ± 2, as compared with the reported 
225 ± 2 mean log density achieved by adversarial nets. 

There are several possible explanations for this. First, kernel 
density estimation is known to perform poorly in high dimen¬ 
sions. Second, the MMD objective can itself be understood as 
the squared difference of two kernel density estimates, and so, 
in a sense, the objective being optimized is directly related to 
the subsequent mean test log density evaluation. There is no 
clear connection for adversarial networks, which might explain 
why it suffers under this test. Our experience suggests that the 
RBF kernel delivers base line performance but that an image- 
specific kernel, capturing, e.g., shift invariance, might lead to 
better images. 


5.3 Toronto face dataset 

We have also trained the generative MMD network on Toronto 
face dataset (TFD) [13]. The parameters were adapted from the 
MNIST experiment: we also used a 3-hidden-layer sigmoidal 
MFP with similar architecture (1000, 600, and 1000 units) and 
RBF kernel for the cost function with the same hyper parameter. 
The training dataset batch sizes were equal to the number of 
generated points (500). The generated points were resampled 
every 500 iterations. The network was optimized for 500,000 
iterations. 

The samples from the resulting network are plotted in Fig. [3] 
The mean log density of the held-out test set is 2283 db 39. Al¬ 
though this figure is higher than the mean log density of 2057 
=b 26 reported in [1], the samples from the MMD network are 
again clearly distinguishable from the training dataset. Thus 
the high test score suggests that kernel density estimation does 
not perform well at evaluating the performance for these high 
dimensional datasets. 

6 Conclusion 

MMD offers a closed form surrogate for the discriminator in 
adversarial nets framework. After using Bayesian optimization 
for the parameters, we found that the network outperformed the 
adversarial network in terms of the density of the held-out test 
set under kernel density estimation. On the other hand, there is 
a clear discrepancy between the digits produced by MMD Nets 
and the MNIST digits, which might suggest that KDE is not 
up to the task of evaluating these models. Given how quickly 







































































Figure 3: (left) TFD. (right) Faces generated by network trained for 500,000 iterations, (center) Generated points after 500 iterations. 


MMD Nets achieves this level of performance, it is worth con¬ 
sidering its use as an initialization for more costly procedures. 
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A Proofs 


only, such that 

Rn{F) < < 


1 ifO < p < 2 

log 2 N ifp = 2 (22) 

ifp > 2. 


Theorem 4 (|2j). Assume 0 < k(xi , Xj) < K, M = N. Then 

Pr [|MMD l(H,X,Y e ) - MMD 2 (^,p data ,p e )| > e] < 6 e 

(23) 


We begin with some preliminaries and known results: 

Definition 2 ({ 8 j). A random variable a is said to be a 
Rademacher random variable if it takes values in { — 1,1}, each 
with probability 1 / 2 . 

Definition 3 ((H)). Let fi be a probability measure on X, and let 
T be a class of uniformly bounded functions on X. Then the 
Rademacher complexity of T (with respect to p) is 


Rn(X) = 


fw\j2<7nf(X n ) 




n=1 


where 


/ e 2 M \ 

5, = 2 exp ) ■ (24) 

The case where 0 is a finite set is elementary: 

Theorem 5 (estimation error for finite parameter set). Let po be 
the distribution of Gq(W), with 0 taking values in some finite 
set 0 = {#i, Ot}, T < oo. Then, with probability at least 
1 — (T + l)5 e , where S £ is defined as in Theorem ^ we have 

MMD 2 (T-L, Pd ata , P§) < MMD 2 (n,Pdata,Pe*) + 2e. (25) 


where a = (< 71 , 02 ,...) is a sequence of independent 

Rademacher random variables, and Xi,X 2 ,... are indepen¬ 
dent, //-distributed random variables, independent also from cr. 

Theorem 2 (McDiarmids Inequality @). Let f : x • • • x 

Xn —>> M and assume there exists ci,..., cn > 0 such that, for 
all k G {1,..., TV}, we have 


Proof Let 

£(6) = MMD 2 u ('H,X,Y e ) (26) 

and 

T{9) = MMD 2 (H, Pdataj Pe) ■ (27) 


sup \f(xi,...,x k ,...,x N ) 

x 1: ...,x k ,x' k ,...,x N 

- f(x l,...,x' k ,...,X N )\ < c k . 


Note, that the upper bound stated in Theorem [4] holds for the 
parameter value 0 *, i.e., 


Then, for all e > 0 and independent random variables 

Cl 5 • ■ • ? Cn Ift 

Pr U)-£(/(&, ■•■, 60) >£)} 


< exp 


-2s 2 


c 2 


Theorem 3 (| 8 , Thm. 2.35]). Let T C B(L 00 (X)). Assume 
there exists 7 > 1, such that for all e > 0, fat e (fF) < ye~ v 
for some p G N. Then there exists constants C p depending on p 


Pr[|£(0*)-T(0*)| >e] <S £ . (28) 


Because 6 depends on the training data X and generator data 
Y, we use a uniform bound that holds over all 0. Specifically, 


Pr 


\S(0)-T(0)\>s 


sup 1 8(0) - T(0)| > e 
e 


< Pr 

T 

<J2^[\m-T(e)\>e 

t =1 

< m. 











































































































This yields that with probability at least 1 — TS e , 

2e > \£(§) - T{9 )| + \S(0*) - T{9*)\ (29) 

> \£(e*) - £(9) + T{9) - T{6*)\. (30) 

Since 9* was chosen to minimize T[ff), we know that 

T(6) > T{9*). (31) 

Similarly, by Eq. ( fl2| l, 

£(9*) > £(§). (32) 

Therefore it follows that 

2e > T(9) - T{9*) (33) 

= MMD 2 (^,p data ,p e .) - MMD 2 (H,p data ,p § ) (34) 

proving the theorem. □ 

Corollary 1. With probability at least 1 — 5, 

MMD 2 ('H,Pdata,Ps) < MMD 2 (W,p data ,p 0 .) + 2e 5 , 

where 


es = 8K\/±\og[2(T+l)6}. 


In order to prove the general result, we begin with some techni¬ 
cal lemmas. The development here owes much to Gretton et al. 

Lemma 1. Let T = {/ : y x y —>> R} 

= {h = /(2/, s(Loo(y)). 

{E n }^ =1 p-distributed independent random variables 
in y. Assume for some 7 > 1 p G N, 

fat e (T+) < 75 _:p , /or a// 5 > 0. For y n G J 7 Vn = 1,... 5 TV, 
de/dze p(2/i,...,2/jv) to fee 


sup 


E(f(Y,Y')) - N J_ 1} E /(yn.yn') 


n^n' 


Then there exists a constant C that depends on p, such that 


E (p(Yi y ..., Y n )) < C 72 ^ 


1 


/AT^T 


^ i/ p = 2 

1 


(N-1)p 


ifp < 2 

ifp = 2 
ifp > 2 . 


Proof Let us introduce {Cn}n=i- where Q n and Y n / have 
the same distribution and are independent for all n,n' £ 
{1,..., iV}. Then the following is true: 

E ( / ( y,r)) = E(— 1— £ 


n,n': n^n' 


Using Jensen’s inequality and the independence of Y, Y' and 
Y n , Y n t , we have 


E(p(Y u ...,Y N )) 

= e( sup E(f(Y, Y')) 
\feT 


< E[ sup 
V feJ 7 


E f( Y m,Ym') 

n^n' 

T _ f(CrnCn) 

n^n' 

E /(^,FnO 

n^n' 


N(N - 1) 

1 

N(N~ 1) 

iV(iV - 1) 


(35) 


(36) 


(37) 


Introducing conditional expectations allows us to rewrite the 
equation with the sum over n outside the expectations. I.e., 
Eq. ([361 equals to 


^Y E ( EiYn ’ <n) ( Sn p\wZ I E (/(Cn.CnO - f{Y n ,Y n .)) 

n ' f eJr n'^ri 

(38) 

/ 1 iV-l X 

=E E™ ( sup | E M/(C, Cn) - f(Y, Y n )) ) 

' f eJr 1 n=l ' 

(39) 

The second equality follows by symmetry of random variables 
{Cn}n=l- Note that we also added Rademacher random vari¬ 
ables {cJ n }nEi before each term in the sum since (/(£ n , Cn') ~ 
f(Y n ,Y n /)) has the same distribution as ~{f{CnXn') ~ 
f(Y n , Y n f)) for all n, n r and therefore the cr’s do not affect the 
expectation of the sum. 

Note that Cm and Y rn are identically distributed. Thus the trian¬ 
gle inequality implies that Eq. p9| is less than or equal to 


n= 1 


< 


1 


Rn-i(E+), 


where i//v-i(d-+) is the Rademacher’s complexity of T+. 
Then by Theorem [3] we have 


E(p(Y 1 ,...,Y N ))<C'y* 


/N=l 


log 3 (AT-l) 

N—l 

1 


k (N—l) p 


if p < 2 

if p = 2 (40) 

if p > 2 . 

□ 


Lemma 2. Let T = {/ : X x y -y R} and 
J ~= {/ \ x x y —> M, x G 


(41) 


C Bf^L^Cy)). Let {X n }^ = i and ^ 1 /- 

\i-distributed independent random variables in X and y, re¬ 
spectively. Assume for some 7 > 1, such that for all e > 0, 

































fat £ (J-+) < 75 p , for some p G N. For all x n G X, n < N, 
and all Pm e y, m < M, define 

p(x 1 ,... ,xn,Vi, • • • 1 Dm ) = 


sup 

/G.F 




Then there exists C that depends on p, such that 
E(p(X 1 ,...,X N ,Y 1 ,...,Y M )) 

Vm if p < 2 

ifP = 2 

*/P > 2. 




Proof. The proof is very similar to that of Lemma j] 


□ 


Proof of Theorem 1 The proof follows the same steps as the 
proof of Theorem [5] apart from a stronger uniform bound stated 
in Eq. {29|. I.e., we need to show: 


Pr 


sup 1 8(0) — T(0) | > e 
Pee 


<S. 


(42) 


Expanding MMD as defined by Eqs. (7) and ( 8 ), and substitut¬ 
ing F = Gq(W), yields 


sup I £(9) -T{9)\ 
oee 


(43) 


= sup 
oee 


E(k(X,X')) 

n'y^n 


+ E(k(Ge(W),G e (W'))) 

- m( m -1) Ewwww.cwww) 


(44) 


m^m' 

2 E(k(X,G e (W))) 
2 


MN 


Y k iXn,G e (W m )) 


For all n G {1,..., N}, k(X n , X n /) does not depend on 6 and 
therefore the first two terms of the equation above can be taken 
out of the supremum. Also, note that since |fc(-, -)| < K, we 
have 


C (*^T 7 • • • 1 %n 7 * • • 5 3?iv) C (*^T 7 * • • ? 5 • • * 5 *^iv) 

where 


< 


2K 

"aT 5 


C(a:i,...,a;jv) = 


1 


iV(iV — 1) 


F, k(x n ,x n ’), 


n,n': n'^n 


and C is an unbiased estimate of E(k(X , A')). Then from Mc- 
Diarmid’s inequality on £, we have 

Pr(|£(fc(X,X')) - X! > e) 

' ' n'^n 


< exp I — 


27f 2 


AT 


(45) 


Therefore Eq. (44) is bounded by the sum of the bound on 
Eq. J45l and the following: 


E(k(Ge(W), Ge(W’))) 

L— Y, HGe(W m ),G e (W m ,)) 


sup 

oee 


M(M — 1) 

v J m^m' 

2 E(k(X,G e (W))) 
2 


(46) 


MA 


Y J KXn.G e {W rn )) 


Thus the next step is to find the bound for the supremum above. 
Define 

f(Wi,...,W M ;Pnoi S e) = f(W M ) 


= sup 
see 


E(k(G e (W),G e (W'))) 

~ mm- 1) E KG»(W m ),G,(W m ,)) 

v ' m^rn' 


and 


^(2^1 7 • • • 5 2C_/v 7 W^l 7 • • • 7 M 7 Pdata7 Pnoise) 

= MX^,ITm) 

1 


= sup 
oee 


Y *(*», G e {W m )) - E(k(X, G e (W))) 


MN 

rn,n 

Then by triangle inequality 

Eq. (46) < f(W M ) + 2 h(X N ,W M ). 


(47) 


We will first find the upper bound on f(W M ), i.e., for every 
6 > 0 , we will show that there exists Sf , such that 


Pr (f(W M )>s)<6 f 
For each m G {!,•••, M}, 


(48) 


(49) 

<-% « 


since the kernel is bounded by K , and therefore 
k(Ge{W rn ) : Ge{W rn ')) is bounded by K for all m. The 
conditions of Theorem |2 are satisfied and thus we can use 
McDiarmids Inequality on /: 


Pr (f(W M ) - E(f(W M )) > e) < exp - 


e 2 M 

2K 2 


Define 


&«{fc(G*(-),<?*(■)): 0G©} 


(51) 


(52) 


To show Eq. ( |48| >, we need to bound the expectation of /. We 
can apply Lemma 1 on the function classes Qk and Gk+- The 
resulting bound is 


E{f{W M )) < e pl = C n ? { 


1 


XMf -1 


, (M—1) pi 


if Pi < 2 

if Pi = 2 
if pi > 2 . 


log 3 (M —1) if _ 2 
M—l U pi — Z , 

1 


(53) 










































where pi and 71 are parameters associated to fat shattering di¬ 
mension of Qk+ as stated in the assumptions of the theorem, 
and Cf is a constant depending on pi . 


Now we can write down the bound on 


/ e 2 M\ 

Pr > £ pi + <0 < exp = s f- ( 54 > 


Similarly, h(X N , W M ) has bounded differences: 

-h(X 1 ,...,X n ,,...,X N ,W u ...,W M ) 


< 


2 K 

W 


(55) 


and 

-h(X l9 ...,X N ,W u ...,W m ',...,W M ) 

McDiarmid’s inequality then implies 

Pr (h(X N ,W M )-E(h(X Nl W M )>e) 


< 


2 K 
~M' 


(56) 


< exp 


s 2 NM 

~2K 2 JvTm) ' 


(57) 


We can bound expectation of h{X^ N ,W_ M ) using Lemma 2 ap¬ 
plied on Qf and G ~{: + , where 


gf = {k(;G e (-)):6eQ}. 


Then 


E(h(X N , W_ M f) — e P2 — Chli < 


1 


yf 


'log 3 (M) 
M 

1 

-T~ 

M p 2 


(58) 

ifp2 < 2 

if P 2 = 2 
if P 2 > 2 . 

(59) 


for some constant Ch that depends on p@. The final bound on h 
is then 


Pr (h(X N ,W M ) > e P2 + e) 


z , £ 2 NM ( 

s “ p| -27pwTm I=, s *' 


(60) 


Summing up the bounds from Eq. ( [54| and Eq. ( [57] ), it follows 
that 


Pr (f(W_ M ) + 2 h(2L N ,W^ M ) — £ pi + 2s P2 + 35 ) 

< max(£/,4) = 4. 


(61) 


Using the bound in Eq. ( f45| , we have obtain the uniform bound 
we were looking for: 


Pr 


sup | £{6) ~T{9) | > e Pl + 2s P2 + 4e 


_6>e© 

which by Eq. ( 29) yields 


< 5h j (62) 


Pr |£(0) — T(0)| > 5 Pl + 25 P2 +45 < (*+. 


Since it was assumed that K — 1 and TV = M, we get 

- / 5 2 M N 

d/i = exp-- 


(64) 


To finish, we proceed as in the proof of Theorem 5] We can 
rearrange some of the terms to get a different form of Eq. ( 28): 


Pr[|£(<9*) -T(0*)| > 25] 

5 2 M 


< 2 exp — 


= 26 h . 


All of the above implies that for any 5 > 0, there exists 5, such 
that 

Pr (MMD 2 {%, p data , p § ) 


- MMD 2 (7f, p d ata? Po *) > e) < <5, 


where 


12 A 2 

£ = £ „+2 £ „ + ^=^o gj . 


(65) 


We can rewrite 5 as: 


£ = r(pi, 7 i,M) + r(p 2 , 72, M - 1) + 12 M 2 ^/log 


( 66 ) 


The rate r(p, 7, IV) is given by Eq. (53) and Eq. (59): 

r (p, 7; M) = C p ^< 


M 2 if p < 2, 

M-Mog5(M) if p = 2, (67) 

if p > 2, 


where the constants C p \ and C p 2 depend on pi and P 2 alone 


□ 


We close by noting that the approximation error is zero in the 
nonparametric limit. 

Theorem 6 (Gretton et al. |2|). Let F be the unit ball in a uni¬ 
versal RKHS TL, defined on the compact metric space X, with 
associated continuous kernel fc(-, •). Then MMD[%,p, q] = 0 
if and only if p = q. 

Corollary 2 (approximation error). Assume Pdata A in the fam¬ 
ily {pe} and that TL is an RKHS induced by a characteristic 
kernel. Then 


inf MMD (7f, p data , pe) = 0 
0 

and the infimum is achieved at 6 satisfying pe = Pdata- 


( 68 ) 


Proof By Theorem 6 , it follows that MMD 2 (TL , •, •) is a metric. 
The result is then immediate. □ 


(63) 
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